Dataset statistics
| Number of variables | 10 |
|---|---|
| Number of observations | 2243791 |
| Missing cells | 5339117 |
| Missing cells (%) | 23.8% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 171.2 MiB |
| Average record size in memory | 80.0 B |
Variable types
| NUM | 5 |
|---|---|
| CAT | 3 |
| UNSUPPORTED | 2 |
Nature culture speciale has a high cardinality: 118 distinct values | High cardinality |
Surface reelle bati has 983859 (43.8%) missing values | Missing |
Nombre pieces principales has 983859 (43.8%) missing values | Missing |
Nature culture has 617126 (27.5%) missing values | Missing |
Nature culture speciale has 2137147 (95.2%) missing values | Missing |
Surface terrain has 617126 (27.5%) missing values | Missing |
Valeur fonciere is highly skewed (γ1 = 119.7062575) | Skewed |
Surface reelle bati is highly skewed (γ1 = 184.2943118) | Skewed |
Surface terrain is highly skewed (γ1 = 30.04295042) | Skewed |
df_index has unique values | Unique |
Code postal is an unsupported type, check if it needs cleaning or further analysis | Unsupported |
Code type local is an unsupported type, check if it needs cleaning or further analysis | Unsupported |
Surface reelle bati has 282155 (12.6%) zeros | Zeros |
Nombre pieces principales has 368414 (16.4%) zeros | Zeros |
Reproduction
| Analysis started | 2020-10-07 10:14:01.381535 |
|---|---|
| Analysis finished | 2020-10-07 10:15:37.595767 |
| Duration | 1 minute and 36.21 seconds |
| Software version | pandas-profiling v2.9.0 |
| Download configuration | config.yaml |
| Distinct | 2243791 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 1240083.201 |
|---|---|
| Minimum | 0 |
| Maximum | 2535790 |
| Zeros | 1 |
| Zeros (%) | < 0.1% |
| Memory size | 17.1 MiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 121050.5 |
| Q1 | 608478.5 |
| median | 1238468 |
| Q3 | 1856156.5 |
| 95-th percentile | 2398652.5 |
| Maximum | 2535790 |
| Range | 2535790 |
| Interquartile range (IQR) | 1247678 |
Descriptive statistics
| Standard deviation | 724137.6734 |
|---|---|
| Coefficient of variation (CV) | 0.5839428134 |
| Kurtosis | -1.173666804 |
| Mean | 1240083.201 |
| Median Absolute Deviation (MAD) | 623788 |
| Skewness | 0.03194925991 |
| Sum | 2.782487526e+12 |
| Variance | 5.2437537e+11 |
| Monotocity | Strictly increasing |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) | |
| 2047 | 1 | < 0.1% | |
| 1511499 | 1 | < 0.1% | |
| 1556561 | 1 | < 0.1% | |
| 1558608 | 1 | < 0.1% | |
| 1519695 | 1 | < 0.1% | |
| 1521742 | 1 | < 0.1% | |
| 1515597 | 1 | < 0.1% | |
| 1517644 | 1 | < 0.1% | |
| 1513546 | 1 | < 0.1% | |
| 1329184 | 1 | < 0.1% | |
| Other values (2243781) | 2243781 | > 99.9% |
| Value | Count | Frequency (%) | |
| 0 | 1 | < 0.1% | |
| 1 | 1 | < 0.1% | |
| 2 | 1 | < 0.1% | |
| 3 | 1 | < 0.1% | |
| 4 | 1 | < 0.1% |
| Value | Count | Frequency (%) | |
| 2535790 | 1 | < 0.1% | |
| 2535789 | 1 | < 0.1% | |
| 2535788 | 1 | < 0.1% | |
| 2535786 | 1 | < 0.1% | |
| 2535785 | 1 | < 0.1% |
Nature mutation
Categorical
| Distinct | 6 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 17.1 MiB |
| Vente | |
|---|---|
| Echange | 26061 |
| Vente en l'état futur d'achèvement | 22083 |
| Vente terrain à bâtir | 7495 |
| Adjudication | 4583 |
| Value | Count | Frequency (%) | |
| Vente | 2183142 | 97.3% | |
| Echange | 26061 | 1.2% | |
| Vente en l'état futur d'achèvement | 22083 | 1.0% | |
| Vente terrain à bâtir | 7495 | 0.3% | |
| Adjudication | 4583 | 0.2% | |
| Expropriation | 427 | < 0.1% |
Frequencies of value counts
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Histogram of lengths of the category
Length
| Max length | 34 |
|---|---|
| Median length | 5 |
| Mean length | 5.377907746 |
| Min length | 5 |
| Distinct | 99365 |
|---|---|
| Distinct (%) | 4.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 508679.8997 |
|---|---|
| Minimum | 0.01 |
| Maximum | 2086000000 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Memory size | 17.1 MiB |
Quantile statistics
| Minimum | 0.01 |
|---|---|
| 5-th percentile | 2300 |
| Q1 | 51960 |
| median | 135000 |
| Q3 | 247000 |
| 95-th percentile | 800000 |
| Maximum | 2086000000 |
| Range | 2086000000 |
| Interquartile range (IQR) | 195040 |
Descriptive statistics
| Standard deviation | 5645992.665 |
|---|---|
| Coefficient of variation (CV) | 11.09930365 |
| Kurtosis | 32045.88504 |
| Mean | 508679.8997 |
| Median Absolute Deviation (MAD) | 93000 |
| Skewness | 119.7062575 |
| Sum | 1.141371381e+12 |
| Variance | 3.187723317e+13 |
| Monotocity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) | |
| 100000 | 20961 | 0.9% | |
| 150000 | 20337 | 0.9% | |
| 120000 | 19012 | 0.8% | |
| 80000 | 18028 | 0.8% | |
| 130000 | 17095 | 0.8% | |
| 110000 | 16921 | 0.8% | |
| 50000 | 16668 | 0.7% | |
| 140000 | 16336 | 0.7% | |
| 1 | 16270 | 0.7% | |
| 200000 | 16141 | 0.7% | |
| Other values (99355) | 2066022 | 92.1% |
| Value | Count | Frequency (%) | |
| 0.01 | 2 | < 0.1% | |
| 0.15 | 122 | < 0.1% | |
| 0.16 | 3 | < 0.1% | |
| 0.18 | 7 | < 0.1% | |
| 0.19 | 2 | < 0.1% |
| Value | Count | Frequency (%) | |
| 2086000000 | 2 | < 0.1% | |
| 1750000000 | 3 | < 0.1% | |
| 690186750 | 24 | < 0.1% | |
| 612990460 | 6 | < 0.1% | |
| 400000000 | 1 | < 0.1% |
| Distinct | 3835 |
|---|---|
| Distinct (%) | 0.3% |
| Missing | 983859 |
| Missing (%) | 43.8% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 90.83265287 |
|---|---|
| Minimum | 0 |
| Maximum | 312962 |
| Zeros | 282155 |
| Zeros (%) | 12.6% |
| Memory size | 17.1 MiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 20 |
| median | 63 |
| Q3 | 96 |
| 95-th percentile | 174 |
| Maximum | 312962 |
| Range | 312962 |
| Interquartile range (IQR) | 76 |
Descriptive statistics
| Standard deviation | 900.2078557 |
|---|---|
| Coefficient of variation (CV) | 9.910619444 |
| Kurtosis | 47314.29444 |
| Mean | 90.83265287 |
| Median Absolute Deviation (MAD) | 37 |
| Skewness | 184.2943118 |
| Sum | 114442966 |
| Variance | 810374.1835 |
| Monotocity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) | |
| 0 | 282155 | 12.6% | |
| 80 | 19661 | 0.9% | |
| 90 | 17823 | 0.8% | |
| 60 | 17756 | 0.8% | |
| 70 | 17389 | 0.8% | |
| 100 | 15331 | 0.7% | |
| 50 | 14415 | 0.6% | |
| 65 | 13056 | 0.6% | |
| 40 | 12673 | 0.6% | |
| 75 | 11967 | 0.5% | |
| Other values (3825) | 837706 | 37.3% | |
| (Missing) | 983859 | 43.8% |
| Value | Count | Frequency (%) | |
| 0 | 282155 | 12.6% | |
| 1 | 260 | < 0.1% | |
| 2 | 195 | < 0.1% | |
| 3 | 197 | < 0.1% | |
| 4 | 132 | < 0.1% |
| Value | Count | Frequency (%) | |
| 312962 | 2 | < 0.1% | |
| 240000 | 2 | < 0.1% | |
| 215000 | 2 | < 0.1% | |
| 212120 | 2 | < 0.1% | |
| 152856 | 6 | < 0.1% |
| Distinct | 40 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 983859 |
| Missing (%) | 43.8% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 2.500972275 |
|---|---|
| Minimum | 0 |
| Maximum | 67 |
| Zeros | 368414 |
| Zeros (%) | 16.4% |
| Memory size | 17.1 MiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 0 |
| median | 3 |
| Q3 | 4 |
| 95-th percentile | 6 |
| Maximum | 67 |
| Range | 67 |
| Interquartile range (IQR) | 4 |
Descriptive statistics
| Standard deviation | 2.099390297 |
|---|---|
| Coefficient of variation (CV) | 0.8394296565 |
| Kurtosis | 2.720708602 |
| Mean | 2.500972275 |
| Median Absolute Deviation (MAD) | 2 |
| Skewness | 0.5417265369 |
| Sum | 3151055 |
| Variance | 4.407439621 |
| Monotocity | Not monotonic |
Histogram with fixed size bins (bins=40)
| Value | Count | Frequency (%) | |
| 0 | 368414 | 16.4% | |
| 4 | 220121 | 9.8% | |
| 3 | 210343 | 9.4% | |
| 2 | 152819 | 6.8% | |
| 5 | 135306 | 6.0% | |
| 1 | 87239 | 3.9% | |
| 6 | 53617 | 2.4% | |
| 7 | 19669 | 0.9% | |
| 8 | 7247 | 0.3% | |
| 9 | 2706 | 0.1% | |
| Other values (30) | 2451 | 0.1% | |
| (Missing) | 983859 | 43.8% |
| Value | Count | Frequency (%) | |
| 0 | 368414 | 16.4% | |
| 1 | 87239 | 3.9% | |
| 2 | 152819 | 6.8% | |
| 3 | 210343 | 9.4% | |
| 4 | 220121 | 9.8% |
| Value | Count | Frequency (%) | |
| 67 | 1 | < 0.1% | |
| 56 | 1 | < 0.1% | |
| 54 | 1 | < 0.1% | |
| 53 | 2 | < 0.1% | |
| 50 | 2 | < 0.1% |
| Distinct | 27 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 617126 |
| Missing (%) | 27.5% |
| Memory size | 17.1 MiB |
| S | |
|---|---|
| T | |
| P | |
| J | |
| AB | |
| Other values (22) |
| Value | Count | Frequency (%) | |
| S | 753946 | 33.6% | |
| T | 247849 | 11.0% | |
| P | 125548 | 5.6% | |
| J | 91931 | 4.1% | |
| AB | 85966 | 3.8% | |
| BT | 73779 | 3.3% | |
| L | 62130 | 2.8% | |
| AG | 58788 | 2.6% | |
| VI | 29115 | 1.3% | |
| BR | 24579 | 1.1% | |
| Other values (17) | 73034 | 3.3% | |
| (Missing) | 617126 | 27.5% |
Frequencies of value counts
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Histogram of lengths of the category
Length
| Max length | 3 |
|---|---|
| Median length | 1 |
| Mean length | 1.697312718 |
| Min length | 1 |
| Distinct | 118 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 2137147 |
| Missing (%) | 95.2% |
| Memory size | 17.1 MiB |
| POTAG | |
|---|---|
| PIN | |
| PATUR | |
| PARC | |
| FRICH | |
| Other values (113) |
| Value | Count | Frequency (%) | |
| POTAG | 26810 | 1.2% | |
| PIN | 9875 | 0.4% | |
| PATUR | 9677 | 0.4% | |
| PARC | 9571 | 0.4% | |
| FRICH | 6208 | 0.3% | |
| VAOC | 5118 | 0.2% | |
| CHAT | 3632 | 0.2% | |
| CHENE | 2630 | 0.1% | |
| PACAG | 2471 | 0.1% | |
| MARAI | 2449 | 0.1% | |
| Other values (108) | 28203 | 1.3% | |
| (Missing) | 2137147 | 95.2% |
Frequencies of value counts
Unique
| Unique | 6 ? |
|---|---|
| Unique (%) | < 0.1% |
Histogram of lengths of the category
Length
| Max length | 5 |
|---|---|
| Median length | 3 |
| Mean length | 3.071866319 |
| Min length | 3 |
| Distinct | 40949 |
|---|---|
| Distinct (%) | 2.5% |
| Missing | 617126 |
| Missing (%) | 27.5% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 2802.197914 |
|---|---|
| Minimum | 0 |
| Maximum | 1662560 |
| Zeros | 56 |
| Zeros (%) | < 0.1% |
| Memory size | 17.1 MiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 30 |
| Q1 | 226 |
| median | 593 |
| Q3 | 1723 |
| 95-th percentile | 11710 |
| Maximum | 1662560 |
| Range | 1662560 |
| Interquartile range (IQR) | 1497 |
Descriptive statistics
| Standard deviation | 10642.79269 |
|---|---|
| Coefficient of variation (CV) | 3.798016063 |
| Kurtosis | 2411.94072 |
| Mean | 2802.197914 |
| Median Absolute Deviation (MAD) | 466 |
| Skewness | 30.04295042 |
| Sum | 4558237270 |
| Variance | 113269036.3 |
| Monotocity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) | |
| 500 | 33216 | 1.5% | |
| 1000 | 14981 | 0.7% | |
| 800 | 4838 | 0.2% | |
| 600 | 4802 | 0.2% | |
| 12 | 4209 | 0.2% | |
| 400 | 4085 | 0.2% | |
| 700 | 3924 | 0.2% | |
| 13 | 3825 | 0.2% | |
| 200 | 3807 | 0.2% | |
| 100 | 3753 | 0.2% | |
| Other values (40939) | 1545225 | 68.9% | |
| (Missing) | 617126 | 27.5% |
| Value | Count | Frequency (%) | |
| 0 | 56 | < 0.1% | |
| 1 | 3521 | 0.2% | |
| 2 | 2840 | 0.1% | |
| 3 | 2548 | 0.1% | |
| 4 | 2653 | 0.1% |
| Value | Count | Frequency (%) | |
| 1662560 | 1 | < 0.1% | |
| 1420388 | 1 | < 0.1% | |
| 1411524 | 4 | < 0.1% | |
| 1250223 | 1 | < 0.1% | |
| 1187767 | 1 | < 0.1% |
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.First rows
| df_index | Nature mutation | Valeur fonciere | Code postal | Code type local | Surface reelle bati | Nombre pieces principales | Nature culture | Nature culture speciale | Surface terrain | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Vente | 37220.0 | 1000 | 2 | 20.0 | 1.0 | NaN | NaN | NaN |
| 1 | 1 | Vente | 185100.0 | 1000 | 2 | 62.0 | 3.0 | NaN | NaN | NaN |
| 2 | 2 | Vente | 185100.0 | 1000 | 3 | 0.0 | 0.0 | NaN | NaN | NaN |
| 3 | 3 | Vente | 209000.0 | 1160 | 1 | 90.0 | 4.0 | S | NaN | 940.0 |
| 4 | 4 | Vente | 134900.0 | 1370 | 1 | 101.0 | 5.0 | S | NaN | 490.0 |
| 5 | 5 | Vente | 192000.0 | 1340 | 1 | 88.0 | 4.0 | S | NaN | 708.0 |
| 6 | 6 | Vente | 45000.0 | 1250 | 1 | 39.0 | 2.0 | S | NaN | 631.0 |
| 7 | 7 | Vente | 45000.0 | 1250 | [5.0] | NaN | NaN | L | NaN | 120.0 |
| 8 | 8 | Vente | 65000.0 | 1000 | 3 | 0.0 | 0.0 | NaN | NaN | NaN |
| 9 | 9 | Vente | 65000.0 | 1000 | 2 | 69.0 | 3.0 | NaN | NaN | NaN |
Last rows
| df_index | Nature mutation | Valeur fonciere | Code postal | Code type local | Surface reelle bati | Nombre pieces principales | Nature culture | Nature culture speciale | Surface terrain | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2243781 | 2535779 | Vente | 17521000.0 | 75004 | 2 | 100.0 | 4.0 | S | NaN | 470.0 |
| 2243782 | 2535780 | Vente | 17521000.0 | 75004 | 2 | 61.0 | 4.0 | S | NaN | 470.0 |
| 2243783 | 2535782 | Vente | 17521000.0 | 75004 | 2 | 70.0 | 3.0 | S | NaN | 470.0 |
| 2243784 | 2535783 | Vente | 17521000.0 | 75004 | 2 | 47.0 | 1.0 | S | NaN | 470.0 |
| 2243785 | 2535784 | Vente | 17521000.0 | 75004 | 2 | 55.0 | 2.0 | S | NaN | 470.0 |
| 2243786 | 2535785 | Vente | 17521000.0 | 75004 | 2 | 66.0 | 4.0 | S | NaN | 470.0 |
| 2243787 | 2535786 | Vente | 17521000.0 | 75004 | 2 | 120.0 | 5.0 | S | NaN | 470.0 |
| 2243788 | 2535788 | Adjudication | 610000.0 | 75004 | 2 | 44.0 | 2.0 | NaN | NaN | NaN |
| 2243789 | 2535789 | Vente | 1400000.0 | 75002 | 4 | 100.0 | 0.0 | NaN | NaN | NaN |
| 2243790 | 2535790 | Vente | 1400000.0 | 75002 | 2 | 97.0 | 3.0 | NaN | NaN | NaN |